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&^1*O*W000 system and some arcane .MX system 
cans tna, can drama„ca lly improve sca„ng Performance^^ ^ ^ ^ for 

within that SSI. The effort at NAS will extend the 
coherency domain to as many processors as practical. 
Larqer systems result in increasing latencies as the 
number of “hops” from one end of the system to another 
grows. In general, this latency is approximately 40n^hop 
+ 285ns. The best-case hop count on any 0rigin3000 
system is 0 hops, or access to node local memory. 


Introduction 


Within the framework of a cooperative agreement 
between NASA and SGI, three (soon to be four) first-of- 
a-kind systems have been built and integrated into the 
NASA Advanced Supercomputing (www.nas.nasa.gov) 
Facility. Two of these machines run in the production 
supercomputing cluster* 1 (a 256p and a 512p 
Origin2000). The third system, a 51 2 -processor 
Origin3000, was brought up the second week of March 
2001 Two of these 51 2p systems will be combined later 
in the summer of 2001 to form the first 1024p single 
system image supercomputer. 

The primary goal of the project is to provide reliable 
platforms to run highly parallel applications that require 
tightly coupled processors. To that end, several 
elements are discussed that will significantly affect the 
outcome: 

Interconnect/Topology - An interconnect topology that 
minimizes latency while providing ample bisection 
bandwidth. 

Processing Elements - A processor that is competitive in 
sustained performance. 

Programming Methodology - The development of a 
programming methodology that will allow efficient use of 
the machine. 


The NUMA design is quite flexible in that many different 
physical topologies can be constructed which are 
optimized for different design goals. As such the 1024 
system will be built in at least 2 different configurations^ 
Consideration of additional factors beyond latency and 
bandwidth, such as how the system will function in a 
degraded mode, footprint, maintainability and cost will be 
weighed prior to moving forward with the final topology in 
late summer 2001 . 

At least three topologies have been proposed for the 
1 024 processor system. 

Quad-Bristled Hypercube - This topology optimizes for 
cost by providing the minimum amount of interconnect 
hardware required. It can also be built today with no 
modification to the low level PROM code that discovers 
the physical layout of the machine at boot time. It suffers 
fronV a limited bisection bandwidth of 25 
megabytes/second/processor and a higher worst-case 
latency of 12 hops. Its primary advantage is that it 
ohmilH uunrk “nut of the box". 


Application Performance - Demonstration of sustained 
performance and scaling on a collection of important 
applications 

System Availability - Reliability such that failures are 
infrequent enough so as not to interfere with day to day 
production use of the system. 

2 Interconnect/Topology 

The OriginSOOO system is a cache-coherent non-uniform 
memory access (ccNUMA) computer. That is all 
memory within a Single System Image (SSI) is globally 


uadraTree 2 - This topology proposed by Ekechi 
wokah (nwokah@sgi.com) optimizes for latency by 
linimizing the radius of the system, a worst case of 6 
ops. while providing a significant increase in bisection 
andwidth over the Quad-Bristled Hypercube. It also 
rovides for tightly integrated groups of 64 cpus for 
ptimal OpenMP scaling. 

)uad-Bristled Fat Tree 2 - This topology optimizes for 
iandwidth, while sacrificing some latency, 7 hops wore 
ase with a bisection bandwidth double that of the 
auadraTree. It provides for tightly integrated groups of 
12 cpus for optimal OpenMP scaling. 
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Initially the system will be brought up in the Quad- 
Bristled Hypercube configuration in order to test basic 
functionality of the system and to work through any 
difficulties with the Operating System. As much as 
practical, tests will be conducted to determine which 
alternate topologies will provide the greatest benefit to 
applications. Any alternative topology will require that 
modifications be made to the PROM code. 


effectively managed up to 128 processors. Indeed, 
managing the memory traffic on 1024 way parallel 
applications will be critical in achieving high-sustained 
rates of performance. Based on previous results for 
0riqin2000 systems, a performance target of 20/o of 
peak (or at least 220 gigaflops) on a real world 
Computational Fluid Dynamics (CFD) design problem 
should be achievable once the system is upgraded from 


3 Processing Elements 

SpecFP numbers for single CPU performance ,nd| C3te 
that SGI/MIPS processors are losing ground to other 
challengers in the market place (figure 1 - current as of 
May, 2001). 


SpecFP 2000 

harmonic mean of base ratios 


SNOMIPS R12k 400 mhz 
IBM P3-2 450mhz 
SN1MIPS R12K 400 mhz 
SUN US3 900 mhz 
HP-PA8600 552 mhz 
SN1MIPS R14k 500 mhz 
AMD Athlon 1 .3 ghz 
Alpha 21264A 667 mhz 
Intel P4 1 .4 ghz 
HP PA-8700 750mhz 
Fijitsu P4 1 .5 ghz 
Alpha 21 264B 833 mhz 
Intel P4 1 .7 ghz 
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Figure 1 

However, this view can be misleading in that the trend in 
hiqh end computing is moving away from single 
processor clusters towards collections of multi-processor 
systems. Presumably because of the difficulty in scaling 
sinqle processor message passing codes to hundreds or 
thousands of processors. The decision most computing 
centers face is not whether SMP's will be used, but how 
many processors the SMP will have. The performance 
picture changes significantly when looking at the 
SpecFP Throughput numbers (figure 2), which provides 
a more realistic representation of how large-scale 
system will perform under load, when all the processors 
are used simultaneously, memory references make it 
beyond the cache, and cooperating processes 
communicate with one another, One can conclude from 
this benchmark that higher sustained rates o 
performance are achieved via a more robust design of 
the memory system and that memory demand can be 


SpecFP 2000 Throughput 



Figure 2 


4 Programming Me thodology 

Shared memory has many benefits over explicit 
messaging. Some of those benefits include: 


1. Decomposition of the problem is much easier, 
and more flexible than that required in an explicit 
messaging code. 

2. The availability of a larger number of feasible 
parallelization strategies. 

3. Large sections of code require no modification to 
port from earlier non-parallel or vector systems^ 
For instance sections of code that deal with I/O 
may require little or no modification to work 
properly. 

4. The average communication latency is less than 
other messaging approaches. 


5. Fewer modifications are required to implement 
the parallelism. 

6. Load balancing is much easier to manage and 
can be dynamic. 


disadvantage in the literature relates to the observed 
caling of OpenMP. This is commonly (and incorrectly) 
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extrapolated to poor performance of shared memory 
systems in general. The success in achieving high levels 
of parallelism on shared memory systems (i.e greater 
than 100 way parallel) has centered on the development 
of a technique conceived of by Jim Tan 

titaft@nasnasa gov) know as Multi Level Parallelism 
(MLP). The MLP technique requires high-level coarse- 
qrain decomposition similar to that for MPI , bu 
communicates between these high level processes via 
shared memory instead of explicit messages Thi 
minimizes the communication latency involved by 
eliminating any software protocol overhead as well as 
allowing any available hardware latency hiding 
mechanisms (e.g., cached writes, outstanding memory 
references, out of order execution, etc) to function. 
OpenMP <5) is then used to parallelize each of these high 
level decomposed pieces. 

In many cases, however, successful application of the 
MLP technique requires the use of arcane, poorly 
documented, and undocumented system calls that must 
be made to manage thread placement and memory 
allocation, thankfully the MLP library hides this from the 
user These calls and others were implemented in the 
MLP library by a local SGI site analyst Bron Nelson 
r hmn@sgi.com) , whose understanding and knowledge 
of IRIX made possible the success of this approach. 

An early version of the MLP library contained a call to a 
routine called PID_TO_NODE written by John 
Richardson of SGI. The routine creates a memory 
locality domain (MLD) and associates it with the process 
id of the calling process. This has the effect of advising 
the kernel that pages allocated by this process are 
allocated on a specific node, and that this process is run 
on a CPU attached to that node. In MLP terminology, 
this is referred to as “pinning” or “PIN to Node” With the 
release of IRIX 6.5.10, some of the interfaces that these 
low level routines used were changed. Although the MLP 
library still worked, “pinning" did not. This resulted in 
temporarily losing performance on the 0rigm3000 
systems. 

5 Application Performance 

At least three major applications have been ported to the 
MLP programming paradigm. INS3D (811) - an unsteady 
CFD code used for incompressible problems such as the 
desian of the Space Shuttle main engine, the Data 
Assimilation System' 9 ’ (DAS), a major NASA climate 
research code based in part on COM3, and 
OVERFLOW' 7 10) a 3D compressible CFD code, used 
commonly throughout NASA and industry. All three 
applications show good scaling and performance well 
into the low to mid 100s of processors and all three 
depend upon MLP techniques to achieve this_ 
Additionally, “pinning” (as discussed below) is required 
by all to achieve the best performance. 


5.1 OVERFLOW - MLP 

The OVERFLOW CFD code is available from NASA. It is 
maintained and distributed by Pieter Buening 
rpnhMninn@larc.nasa.gov) at NASA Langley. Subject to 
the NASA guidelines for technology export, one can 
contact Pieter and obtain a copy of OVERFLOW-MLP, 
be sure to specifically ask for the MLP version. The MLP 
version was developed by Jim Taft ( l taft@nas.nasa .qoyj 
and can sustain over 60 gigaflops on the 512 
processor 0rigin2000. The MLP version maintains all he 
physics options of the standard release, but has the 
additional MLP functionality that includes both a runtime 
and dynamic load balancing capability. 

Soon after the 51 2p 0rigin3000 system was brought on- 
line, Dennis Jespersen (jesperse@nas.nasa go y) 
grabbed the latest and greatest MLP OVERFLOW 
source, and ran a 32.2 million grid point problem with 
150 zones (figure 3). The figure shows the amount of 
time a single time step takes. As shown, the scaling on 
OVERFLOW basically stops at 64 processors. 

Overflow - Transport Configured for Landing 
32 Million Polnts/150 zones 
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Figure 3 

It was known that the pinning code did not work and 
Bron Nelson fbron@sqi.conr!) then set off to 'create a 
new and improved version of PIDJTO_NODE for IRIX 
6.5.10 and above called MP_ASSIGN_TO_CPU The 
interface to this code is very simple and elegant, bu 
what it does is crucial to scalability, and how it does it is 
not well known outside of SGI engineering. 

5.2 Pin to Node via Mp_assign_to_cpu.c 

Mp assign_to_cpu.c is called is called from the forkitf 
routine in the MLP library (the MLP library is available 
from Jim Taft of Sienna Software ( j taft@nas.nasa.go y)). 
The program is single threaded up to this point in the 
execution of the code. After the call to forkit, numpro 
MLP process will exist. In the call to this version o 
forkit f, nthread is an array of integers, each of which 
defines the number of threads in each of the MLP 


3 



processes, numpro is an integer that represents the 
number of MLP processes to create, and nowpro returns 
to the newly created MLP process, its unique MLP id 
number: 

subroutine forkit3.f(nthread, numpro, nowpro) 


c count offsets to starting cpus 

ival=0 

do n=1, numpro 
istart(n) = ival 
ival=ival+nthread(n) 
enddo 

c spawn the threads - manual forks 

do n=2,numpro 
nowpid^getpidO 
if(nowpid.eq. master) then 
ierrorsforkO 
endif 

nowpid-getpld() 
if(nowpid.ne.master) then 
nowpro=n 
go to 200 
endif 

enddo 

200 call omp_set_num_threads(nthread(nowpro)) 


c pin to cpus 


!$omp parallel 

call mp_assign_to_cpu3 

(omp get_thread_num(), 


!$omp end parallel 


i start (nowpro)) 


As shown, each thread of each MLP process makes a 
call to mp_assign_to_cpu.c. As stated earlier, 
mp assign_to_cpu.c makes some interesting systems 
calls roughly characterized by the following sequence of 
events: 


1 Figure out which physical memory each cpu is 
attached to: 


sysmp(MP_NUMA_GETCPUNODEMAP, . 

(void *)cpu_node_mapping, sizeof(cnodeid_t) ncpus) 


2. Figure out which physical processors the program 
has access to by looking at the CPUSET it is 
running. Also look at the NODEMASK to see which 
nodes are available. Map the NODEMASK nodes to 
CPUs and take the logical AND of the CPUSET and 
NODEMASK. Create an array from 1 to n of 
available cpus to assign threads to: 

pmoctl(PMO_GETNODEMASK_UINT64,&sys_nodemask, 

sizeof(sys nodemask)) 

req. request = CPUSET_QUERY_CPUS, 

sysmp(MP_CPUSET, &req) 


3 Demand that physical pages be allocated from the 
memory attached to the CPU assigned to this 
thread: 


mld_create( 1,1024) 
midset create(&mld, 1) 

midset j>lacc(mldset, TOPOLOG Y_PH YSNODES, 
„ ^ 77 t DAwnnp MANDATORY); 


4. Lock the thread that called this routine to a cpu 

sysmp(MP_MUSTRUN, cpulist[my_rank+starting_point]) 

The effect of the mp_assign_to_cpu call is shown in 
graphically in figure 4 if the forkit routine were called with 
the following: nthread=[8,13 161 ,numpro=n 


MLP #1 MLP t(1 




Figure 4 


MLP #n 



For example, OMP thread 1, of MLP process^ should 
always run on CPU 9. Any memory allocated by th s 
thread is placed on Node 2, the closest memory to CPU 
9. This has the effect of improving locality. Th 
technique is applicable to pure OpenMP codes MP 
codes, and multi-level MPI codes as well Its 
effectiveness is dependent on the apphcat.on and the 
number of CPUs in use, and it has demonstrated a 
dramatic on several codes worked on at NAS. 

Figures 5 and 6 show the effect of the pin code vs the 
non-pin code (seconds per time step in figure 5 and 
speedup over 32 cpus in figure 6). As shown, scaling is 
significantly improved with gains all the way to 480 cpus^ 
It is worthwhile to note that the load balancing for th s 
problem was computed by the application as part of i 
initialization. This is quite significant and re P re ^ ents 
another big advantage of shared memory systems - 
load balancing is flexible and dynamic. With hand 
optimization of the load balancing, better speedup 
efficiencies can be achieved. 


4 



r 


Overflow - Transport Configured for Landing 
32 Million Points/150 zones 
No Pin vs Pin (03000) 



Figure 5 



32 64 96 120 240 256 384 480 

#CPUs 


Figure 6 


6 System Availability 


7 Conclusion 

Progress continues to be made in making large scale 

shared memory systems work and in achieving high 

levels of performance on important NASA applications. 

To scale optimally beyond 64 cpus pinning techniques 

may be required. 
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